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Abstract 


Given a matrix M of low-rank, we consider the problem of reconstructing it from noisy observa- 
tions of a small, random subset of its entries. The problem arises in a variety of applications, from 
collaborative filtering (the ‘Netflix problem’) to structure-from-motion and positioning. We study 
a low complexity algorithm introduced by Keshavan, Montanari, and Oh (2010), based on a com- 
bination of spectral techniques and manifold optimization, that we call here OPTSPACE. We prove 
performance guarantees that are order-optimal in a number of circumstances. 

Keywords: matrix completion, low-rank matrices, spectral methods, manifold optimization 


1. Introduction 


Spectral techniques are an authentic workhorse in machine learning, statistics, numerical analysis, 
and signal processing. Given a matrix M, its largest singular values—and the associated singular 
vectors—‘explain’ the most significant correlations in the underlying data source. A low-rank ap- 
proximation of M can further be used for low-complexity implementations of a number of linear 
algebra algorithms (Frieze et al., 2004). 

In many practical circumstances we have access only to a sparse subset of the entries of an 
m x n matrix M. It has recently been discovered that, if the matrix M has rank r, and unless it is too 
‘structured’, a small random subset of its entries allow to reconstruct it exactly. This result was first 
proved by Candés and Recht (2008) by analyzing a convex relaxation introduced by Fazel (2002). A 
tighter analysis of the same convex relaxation was carried out by Candés and Tao (2009). A number 
of iterative schemes to solve the convex optimization problem appeared soon thereafter (Cai et al., 
2008; Ma et al., 2009; Toh and Yun, 2009). 

In an alternative line of work, Keshavan, Montanari, and Oh (2010) attacked the same problem 
using a combination of spectral techniques and manifold optimization: We will refer to their al- 
gorithm as OPTSPACE. OPTSPACE is intrinsically of low complexity, the most complex operation 
being computing r singular values (and the corresponding singular vectors) of a sparse m x n matrix. 
The performance guarantees proved by Keshavan et al. (2010) are comparable with the information 
theoretic lower bound: roughly nrmax{r,logn} random entries are needed to reconstruct M exactly 
(here we assume m of order n). A related approach was also developed by Lee and Bresler (2009), 
although without performance guarantees for matrix completion. 
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The above results crucially rely on the assumption that M is exactly a rank r matrix. For many 
applications of interest, this assumption is unrealistic and it is therefore important to investigate 
their robustness. Can the above approaches be generalized when the underlying data is ‘well ap- 
proximated’ by a rank r matrix? This question was addressed by Candés and Plan (2009) within the 
convex relaxation approach of Candès and Recht (2008). The present paper proves a similar robust- 
ness result for OPTSPACE. Remarkably the guarantees we obtain are order-optimal in a variety of 
circumstances, and improve over the analogous results of Candés and Plan (2009). 


1.1 Model Definition 
Let M be an m x n matrix of rank r, that is 
M=UYV’. (1) 


where U has dimensions m x r, V has dimensions n x r, and È is a diagonal r x r matrix. We assume 
that each entry of M is perturbed, thus producing an ‘approximately’ low-rank matrix N, with 


Nij = Mij +Zij, 


where the matrix Z will be assumed to be ‘small’ in an appropriate sense. 
Out of the m x n entries of N, a subset E C |m] x [n] is revealed. We let N¥ be the m x n matrix 
that contains the revealed entries of N, and is filled with 0’s in the other positions 


yE Í Mi if J)EE, 
He 0 otherwise. 


Analogously, we let MË and ZË be the m x n matrices that contain the entries of M and Z, re- 
spectively, in the revealed positions and is filled with 0’s in the other positions. The set E will be 
uniformly random given its size |E]. 


1.2 Algorithm 


For the reader’s convenience, we recall the algorithm introduced by Keshavan et al. (2010), which 
we will analyze here. The basic idea is to minimize the cost function F(X,Y), defined by 


F(X,Y) = aun F(X YS), (2) 
€ rxr 
1 

F(X,Y,S) = 5 E Ny- (XSY") ij)”. 
(i j)€E 


Here X € R”*", Y € R”*’ are orthogonal matrices, normalized by X TX = ml, YTY =nl. 

Minimizing F(X,Y) is an a priori difficult task, since F is a non-convex function. The key 
insight is that the singular value decomposition (SVD) of NË provides an excellent initial guess, 
and that the minimum can be found with high probability by standard gradient descent after this 
initialization. Two caveats must be added to this description: (1) In general the matrix NF must be 
‘trimmed’ to eliminate over-represented rows and columns; (2) For technical reasons, we consider 
a slightly modified cost function to be denoted by F(X,Y). 
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OPTSPACE( matrix NF ) 

l: Trim N*, and let NE be the output; 

2: Compute the rank-r projection of NË, P,(NE) = XoSo¥d; 

3: Minimize F(X,Y) through gradient descent, with initial condition (Xo, Yo). 








We may note here that the rank of the matrix M, if not known, can be reliably estimated from 
NE (Keshavan and Oh, 2009). 

The various steps of the above algorithm are defined as follows. 

Trimming. We say that a row is ‘over-represented’ if it contains more than 2|E|/m revealed 
entries (i.e., more than twice the average number of revealed entries per row). Analogously, a 
column is over-represented if it contains more than 2|E|/n revealed entries. The trimmed matrix NE 
is obtained from NF by setting to 0 over-represented rows and columns. 

Rank-r projection. Let 


min(m,n) 
AVE T 
N= Ł OXY; 5 
i=l 
be the singular value decomposition of NE, with singular values 6; > 02 > .... We then define 


a mn 
P,(N*) = IE] oxy. 
i=1 


Apart from an overall normalization, P,.(NE ) is the best rank-r approximation to NE in Frobenius 
norm. 
Minimization. The modified cost function F is defined as 


F(X,Y) 


F(X,Y)+pG(X,Y) 


$ Ix®]? ` jve 
FX, Y) +p} Gi | —#— ] +p Ya ' 
(X,Y) Pd ( aay P% 1 


j=l 3uor 





where X denotes the i-th row of X , and YC) the j-th row of Y. The function G; : Rt — R is such 
that Gı (z) =0 if z< 1 and G1 (z) = e=)? — 1 otherwise. Further, we can choose p = O(\E|). 

Let us stress that the regularization term is mainly introduced for our proof technique to work 
(and a broad family of functions G; would work as well). In numerical experiments we did not find 
any performance loss in setting p = 0. 

One important feature of OPTSPACE is that F(X,Y) and F(X,Y) are regarded as functions 
of the r-dimensional subspaces of R” and R” generated (respectively) by the columns of X and 
Y. This interpretation is justified by the fact that F(X,Y) = F(XA,YB) for any two orthogonal 
matrices A, B € R”*” (the same property holds for F). The set of r dimensional subspaces of R” 
is a differentiable Riemannian manifold G(m,r) (the Grassmann manifold). The gradient descent 
algorithm is applied to the function F : M(m,n) = G(m,r) x G(n,r) + R. For further details on 
optimization by gradient descent on matrix manifolds we refer to Edelman et al. (1999) and Absil 
et al. (2008). 
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1.3 Some Notations 


The matrix M to be reconstructed takes the form (1) where U € R”*’, V € R”*". We write U = 
[u1,U2,...,u,] and V = [v},v2,...,v,] for the columns of the two factors, with ||u;|| = vm, ||vil|| = vn, 
and ufu j=9, viy j = 0 for i Æ j (there is no loss of generality in this, since normalizations can be 
absorbed by redefining X). 

We shall write X = diag(X1,...,2,) with £1 > LY) >--- >, >0. The maximum and minimum 
singular values will also be denoted by Xmax = Ly and Lmin = X;. Further, the maximum size of an 
entry of M is Mmax = max;;|Mj;|. 

Probability is taken with respect to the uniformly random subset E C |m] x [n] given |E| and 
(eventually) the noise matrix Z. Define £ = |E|/\/mn. In the case when m = n, € corresponds to the 
average number of revealed entries per row or column. Then it is convenient to work with a model 
in which each entry is revealed independently with probability ¢/./mn. Since, with high probability 
|E| € [evan — Aynlogn,£ yan +Avynlogn], any guarantee on the algorithm performances that 
holds within one model, holds within the other model as well if we allow for a vanishing shift in €. 
We will use C, C’ etc. to denote universal numerical constants. 

It is convenient to define the following projection operator Pg (-) as the sampling operator, which 
maps an m x n matrix onto an |E|-dimensional subspace in R”*” 





N; if (i,j) CE, 
Pe(N)ii = { 0 otherwise. 

Given a vector x € R”, ||x|| will denote its Euclidean norm. For a matrix X € R”, ||X||r is its 
Frobenius norm, and ||X ||2 its operator norm (i.e., ||X ||2 = sup,.zo |[Xu||/||u||). The standard scalar 
product between vectors or matrices will sometimes be indicated by (x,y) or (X,Y) = Tr(X7Y), 
respectively. Finally, we use the standard combinatorics notation [n] = {1,2,...,} to denote the 
set of first integers. 


1.4 Main Results 


Our main result is a performance guarantee for OPTSPACE under appropriate incoherence assump- 
tions, and is presented in Section 1.4.2. Before presenting it, we state a theorem of independent 
interest that provides an error bound on the simple trimming-plus-SVD approach. The reader inter- 
ested in the OPTSPACE guarantee can go directly to Section 1.4.2. 

Throughout this paper, without loss of generality, we assume & = m/n > 1. 


1.4.1 SIMPLE SVD 


Our first result shows that, in great generality, the rank-r projection of NE provides a reasonable 
approximation of M. We define ZË to be an m x n matrix obtained from Z*, after the trimming step 
of the pseudocode above, that is, by setting to zero the over-represented rows and columns. 


Theorem 1.1 Let N = M +Z, where M has rank r, and assume that the subset of revealed entries 
E C |m] x [n] is uniformly random with size |E|. Let Mmax = Max (j, j)e[m] x{n) |Mij|- Then there exists 
numerical constants C and C' such that 


1/2 
1 ~ nroe/2 ny/Ta >~ 
Fran IM Pr") le < CM max ( IE] ) tC E| IZ" l2, 





2060 


MATRIX COMPLETION FROM NOISY ENTRIES 


with probability larger than 1 —1/n?. 


Projection onto rank-r matrices through SVD is a pretty standard tool, and is used as first analysis 
method for many practical problems. At a high-level, projection onto rank-r matrices can be in- 
terpreted as ‘treat missing entries as zeros’. This theorem shows that this approach is reasonably 
robust if the number of observed entries is as large as the number of degrees of freedom (which is 
about (m + n)r) times a large constant. The error bound is the sum of two contributions: the first 
one can be interpreted as an undersampling effect (error induced by missing entries) and the second 
as a noise effect. Let us stress that trimming is crucial for achieving this guarantee. 


1.4.2 OPTSPACE 


Theorem 1.1 helps to set the stage for the key point of this paper: a much better approximation 
is obtained by minimizing the cost F (X,Y) (step 3 in the pseudocode above), provided M satisfies 
an appropriate incoherence condition. Let M = UV" be a low rank matrix, and assume, without 
loss of generality, UTU = mI and VTV = nI. We say that M is (uo, u1 )-incoherent if the following 
conditions hold. 


A1. For all i € [m], j € [n] we have, Xz: U$ < uor, Xi- ve < por. 

A2. For all i € [ml], j € [n] we have, | £5; Vie(Zx/EZ1)Vix| < pr". 
Theorem 1.2 Let N = M +Z, where M is a (uo, u1 )-incoherent matrix of rank r, and assume that 
the subset of revealed entries E C |m] x |n] is uniformly random with size |E|. Further, let Unin = 


E, <- < E = Emax With Dex Sin, = K. Let M be the output of OPTSPACE on input NË. Then 
there exists numerical constants C and C' such that if 


|E| > Cnv/axK max {uorValogn ; Mor OK" ; (ir akt} ; 
then, with probability at least 1 — 1 / n, 


is. ut n/T 
Jra lM -Mlle $C gy lZ Io. (3) 


provided that the right-hand side is smaller than X pin. 


As discussed in the next section, this theorem captures rather sharply the effect of important 
classes of noise on the performance of OPTSPACE. 


1.5 Noise Models 


In order to make sense of the above results, it is convenient to consider a couple of simple models 
for the noise matrix Z: 


Independent entries model. We assume that Z’s entries are i.i.d. random variables, with zero 
mean E{Z;;} = 0 and sub-Gaussian tails. The latter means that 














2 
P{|Zi;| >x} < 2e 20% 5 


for some constant o? uniformly bounded in n. 


2061 


KESHAVAN, MONTANARI AND OH 


Worst case model. In this model Z is arbitrary, but we have an uniform bound on the size of its 
entries: |Zj;| < Zmax- 


The basic parameter entering our main results is the operator norm of ZE, which is bounded as 
follows in these two noise models. 


Theorem 1.3 Jf Z is a random matrix drawn according to the independent entries model, then for 
any sample size |E| there is a constant C such that, 


~ E|logn\ 1! 
Zei < co ( ERE) a 


with probability at least 1 — 1 /n?. Further there exists a constant C' such that, if the sample size is 
|E| > nlogn (for n > a), we have 


n EIN 1⁄2 
Zesco (2) © 
with probability at least 1 — 1 / n°. 
If Z is a matrix from the worst case model, then 
~ 2\|E 
|Z < a 


nya 
for any realization of E. 


It is elementary to show that, if |E| > 15anlogn, no row or column is over-represented with high 
probability. It follows that in the regime of |E| for which the conditions of Theorem 1.2 are satisfied, 
we have ZË = ZF and hence the bound (5) applies to Z£ ||2 as well. Then, among the other things, 
this result implies that for the independent entries model the right-hand side of our error estimate, 
Eq. (3), is with high probability smaller than Lmin, if |E] > Crank*(o/Zmin). For the worst case 
model, the same statement is true if Zax < Zmin/C /rK?. 


1.6 Comparison with Other Approaches to Matrix Completion 


Let us begin by mentioning that a statement analogous to our preliminary Theorem 1.1 was proved 
by Achlioptas and McSherry (2007). Our result however applies to any number of revealed entries, 
while the one of Achlioptas and McSherry (2007) requires |E| > (8logn)*n (which for n < 5-108 
is larger than n?). We refer to Section 1.8 for further discussion of this point. 

As for Theorem 1.2, we will mainly compare our algorithm with the convex relaxation approach 
recently analyzed by Candés and Plan (2009), and based on semidefinite programming. Our basic 
setting is indeed the same, while the algorithms are rather different. 

Figures 1 and 2 compare the average root mean square error ||M — M||-/./mn for the two al- 
gorithms as a function of |E| and the rank-r respectively. Here M is a random rank r matrix of 
dimension m = n = 600, generated by letting M = UV" with Ui;, Vij i.i.d. N(0,20/./n). The noise 
is distributed according to the independent noise model with Z;; ~ N (0,1). In the first suite of sim- 
ulations, presented in Figure 1, the rank is fixed to r = 2. In the second one (Figure 2), the number 
of samples is fixed to |E| = 72000. These examples are taken from Candés and Plan (2009, Figure 
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r 1 
Convex Relaxation ——-— 


1px Lower Bound =——— } 
x rank-r projection =-->- 
Ji OptSpace : | iteration === 
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a 3 iterations ==- 


10 iterations —-e— 
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|E|/n 


Figure 1: Numerical simulation with random rank-2 600 x 600 matrices. Root mean square error 
achieved by OPTSPACE is shown as a function of the number of observed entries |E| and 
of the number of line minimizations. The performance of nuclear norm minimization and 
an information theoretic lower bound are also shown. 
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Figure 2: Numerical simulation with random rank-r 600 x 600 matrices and number of observed 
entries |E|/n = 120. Root mean square error achieved by OPTSPACE is shown as a 
function of the rank and of the number of line minimizations. The performance of nuclear 
norm minimization and an information theoretic lower bound are also shown. 
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Figure 3: Numerical simulation with random rank-2 600 x 600 matrices and number of observed 
entries |E|/n = 80 and 160. The standard deviation of the i.i.d. Gaussian noise is 0.001. 
Fit error and root mean square error achieved by OPTSPACE are shown as functions of 
the number of line minimizations. Information theoretic lower bounds are also shown. 


2), from which we took the data points for the convex relaxation approach, as well as the informa- 
tion theoretic lower bound described later in this section. After a few iterations, OPTSPACE has a 
smaller root mean square error than the one produced by convex relaxation. In about 10 iterations 
it becomes indistinguishable from the information theoretic lower bound for small ranks. 

In Figure 3, we illustrate the rate of convergence of OPTSPACE. Two metrics, root mean squared 
error(RMSE) and fit error ||Pz(M—N)||-/ VIE], are shown as functions of the number of iterations 
in the manifold optimization step. Note, that the fit error can be easily evaluated since NE = Pr(N) 
is always available at the estimator. M is a random 600 x 600 rank-2 matrix generated as in the 
previous examples. The additive noise is distributed as Z;; ~ N(0,07) with o = 0.001 (A small noise 
level was used in order to trace the RMSE evolution over many iterations). Each point in the figure 
is the averaged over 20 random instances, and resulting errors for two different values of sample 
size |E| = 80 and |E| = 160 are shown. In both cases, we can see that the RMSE converges to the 
information theoretic lower bound described later in this section. The fit error decays exponentially 
with the number iterations and converges to the standard deviation of the noise which is 0.001. This 
is a lower bound on the fit error when r & n, since even if we have a perfect reconstruction of M, 
the average fit error is still 0.001. 

For a more complete numerical comparison between various algorithms for matrix completion, 
including different noise models, real data sets and ill conditioned matrices, we refer to Keshavan 
and Oh (2009). 

Next, let us compare our main result with the performance guarantee of Candés and Plan (2009, 
Theorem 7). Let us stress that we require the condition number x to be bounded, while the analysis 
of Candés and Plan (2009) and Candés and Tao (2009) requires a stronger incoherence assumption 
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(compared to our A1). Therefore the assumptions are not directly comparable. As far as the error 
bound is concerned, Candés and Plan (2009) proved that the semidefinite programming approach 
returns an estimate M which satisfies 


1 4 P n E 2 E 
Jain |Mspe — Mlle <7 Ej Z bira lr. (6) 
(The constant in front of the first term is in fact slightly smaller than 7 in Candès and Plan (2009), 
but in any case larger than 4v2. We choose to quote a result which is slightly less accurate but 
easier to parse.) 

Theorem 1.2 improves over this result in several respects: (1) We do not have the second term on 
the right-hand side of (6), that actually increases with the number of observed entries; (2) Our error 
decreases as n/|E| rather than (n/|E|)!/*; (3) The noise enters Theorem 1.2 through the operator 
norm ||ZË||2 instead of its Frobenius norm ||Z*||r > ||Z"||2. For E uniformly random, one expects 
|Z" || to be roughly of order ||Z"||2\/n. For instance, within the independent entries model with 
bounded variance o, ||Z" |r = @(./E|) while ||Z"||2 is of order \/|E|/n (up to logarithmic terms). 

Theorem 1.2 can also be compared to an information theoretic lower bound computed by Candés 
and Plan (2009). Suppose, for simplicity, m = n and assume that an oracle provides us a linear 
subspace T where the correct rank r matrix M = UXV" lies. More precisely, we know that M € T 
where T is a linear space of dimension 2nr — r? defined by 


T = {UYT +XVT |X € RY, Y € R}. 


Notice that the rank constraint is therefore replaced by this simple linear constraint. The minimum 
mean square error estimator is computed by projecting the revealed entries onto the subspace T, 
which can be done by solving a least squares problem. Candès and Plan (2009) analyzed the root 
mean squared error of the resulting estimator M and showed that 


1 


—M||r ~ 
|E| 


Fag Mont IZ lle. 
Here ~ indicates that the root mean squared error concentrates in probability around the right-hand 
side. 

For the sake of comparison, suppose we have i.i.d. Gaussian noise with variance o°. In this case 
the oracle estimator yields (for r = o(n)) 


2nr 
—M||- xo == 


The bound (6) on the semidefinite programming approach yields 


1 & 
mn | |M: Oracle 


i as 2 
_—_||Mspp -M <o(7 E STNE 
L lso- Mlle < o (7 VJE] + Ie 


Finally, using Theorems 1.2 and 1.3 we deduce that OPTSPACE achieves 


1 ~ Cnr 
—— |M —M|\|\rF S04] 7- 
mn | OptSpace Wa = JE] 


Hence, when the noise is i.i.d. Gaussian with small enough 0, OPTSPACE is order-optimal. 
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1.7 Related Work on Gradient Descent 


Local optimization techniques such as gradient descent of coordinate descent have been intensively 
studied in machine learning, with a number of applications. Here we will briefly review the recent 
literature on the use of such techniques within collaborative filtering applications. 


Collaborative filtering was studied from a graphical models perspective in Salakhutdinov et al. 
(2007), which introduced an approach to prediction based on Restricted Boltzmann Machines (RBM). 
Exact learning of the model parameters is intractable for such models, but the authors studied the 
performances of a contrastive divergence, which computes an approximate gradient of the likeli- 
hood function, and uses it to optimize the likelihood locally. Based on empirical evidence, it was 
argued that RBM’s have several advantages over spectral methods for collaborative filtering. 


An objective function analogous to the one used in the present paper was considered early on 
in Srebro and Jaakkola (2003), which uses gradient descent in the factors to minimize a weighted 
sum of square residuals. Salakhutdinov and Mnih (2008) justified the use of such an objective 
function by deriving it as the (negative) log-posterior of an appropriate probabilistic model. This 
approach naturally lead to the use of quadratic regularization in the factors. Again, gradient descent 
in the factors was used to perform the optimization. Also, this paper introduced a logistic mapping 
between the low-rank matrix and the recorded ratings. 


Recently, this line of work was pushed further in Salakhutdinov and Srebro (2010), which em- 
phasize the advantage of using a non-uniform quadratic regularization in the factors. The basic 
objective function was again a sum of square residuals, and version of stochastic gradient descent 
was used to optimize it. 


This rich and successful line of work emphasizes the importance of obtaining a rigorous under- 
standing of methods based on local minimization of the sum of square residuals with respect to the 
factors. The present paper provides a first step in that direction. Hopefully the techniques developed 
here will be useful to analyze the many variants of this approach. 


The relationship between the non-convex objective function and convex relaxation introduced 
by Fazel (2002) was further investigated by Srebro et al. (2005) and Recht et al. (2007). The basic 
relation is provided by the identity 


lo, 
IMI = 3 min, {XIE +Y IR}. o 


where ||M ||» denotes the nuclear norm of M (the sum of its singular values). In other words, adding a 
regularization term that is quadratic in the factors (as the one used in much of the literature reviewed 
above) is equivalent to weighting M by its nuclear norm, that can be regarded as a convex surrogate 
of its rank. 


In view of the identity (7) it might be possible to use the results in this paper to prove stronger 
guarantees on the nuclear norm minimization approach. Unfortunately this implication is not im- 
mediate. Indeed in the present paper we assume the correct rank r is known, while on the other 
hand we do not use a quadratic regularization in the factors. (See Keshavan and Oh, 2009 for a 
procedure that estimates the rank from the data and is provably successful under the hypotheses of 
Theorem 1.2.) Trying to establish such an implication, and clarifying the relation between the two 
approaches is nevertheless a promising research direction. 
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1.8 On the Spectrum of Sparse Matrices and the Role of Trimming 


The trimming step of the OPTSPACE algorithm is somewhat counter-intuitive in that we seem to be 
wasting information. In this section we want to clarify its role through a simple example. Before 
describing the example, let us stress once again two facts: (i) In the last step of our the algorithm, 
the trimmed entries are actually incorporated in the cost function and hence the full information 
is exploited; (ii) Trimming is not the only way to treat over-represented rows/columns in MË, and 
probably not the optimal one. One might for instance rescale the entries of such rows/columns. We 
stick to trimming because we can prove it actually works. 

Let us now turn to the example. Assume, for the sake of simplicity, that m =n, there is no 
noise in the revealed entries, and M is the rank one matrix with M;; = 1 for all i and j. Within 
the independent sampling model, the matrix ME has i.i.d. entries, with distribution Bernoulli(e/n). 
The number of non-zero entries in a column is Binomial(n,¢/n) and is independent for different 
columns. It is not hard to realize that the column with the largest number of entries has more than 
C logn/loglogn entries, with positive probability (this probability can be made as large as we want 
by reducing C). Let i be the index of this column, and consider the test vector el? that has the i-th 
entry equal to 1 and all the others equal to 0. By computing ||M¥e ||, we conclude that the largest 
singular value of MË is at least \/Clogn/loglogn. In particular, this is very different from the 
largest singular value of E{M*} = (e/n)M which is €. This suggests that approximating M with 
the P,(M*) leads to a large error. Hence trimming is crucial in proving Theorem 1.1. Also, the 
phenomenon is more severe in real data sets than in the present model, where each entry is revealed 
independently. 

Trimming is also crucial in proving Theorem 1.3. Using the above argument, it is possible to 
show that under the worst case model, 


logn 
ZF ||2 > C'(€)Zmax 4| ——— - 
IZ" ll2 > C (£) Zmax loglogn 


This suggests that the largest singular value of the noise matrix Z* is quite different from the largest 
singular value of E{Z* } which is €Zmax. 

To summarize, Theorems 1.1 and 1.3 (for the worst case model) simply do not hold without 
trimming or a similar procedure to normalize rows/columns of NE. Trimming allows to overcome 
the above phenomenon by setting to 0 over-represented rows/columns. 


























2. Proof of Theorem 1.1 


As explained in the introduction, the crucial idea is to consider the singular value decomposition 
of the trimmed matrix N* instead of the original matrix NË. Apart from a trivial rescaling, these 
singular values are close to the ones of the original matrix M. 


Lemma 1 There exists a numerical constant C such that, with probability greater than 1 — 1 /n?°, 


(07 IOA i eee 
|22 -z4 < CMmax ZF IIgs 
€ € € 


where it is understood that X, =0 for q >r. 
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Proof For any matrix A, let 6,(A) denote the qth singular value of A. Then, 6,(A +B) < 6,(A) + 
01(B), whence 











y < 6,(MF) \ 01 (Z") 
~ “<q = qj | e 
a 1 SE 
< CMma —+-|Z l2, 
€ € 


where the second inequality follows from the next Lemma as shown by Keshavan et al. (2010). 


Lemma 2 (Keshavan, Montanari, Oh, 2009) There exists a numerical constant C such that, with 


probability larger than 1 — 1 /n?, 
1 / Ol 
< CM max a 
2 € 


Jmn 





4/ mn JE 
€ 














We will now prove Theorem 1.1. 
Proof (Theorem 1.1) For any matrix A of rank at most 2r, ||A]|r < V2r||A||2, whence 


d 


N 


1 > 
Tan MAPAN 


IA 





4 / mM. 
VE || 4 > 
Len ia ZE pe ono) 


= 


i>r+1 





«/mn /~ 
M : ag X Oxi) | 





= (u Js) 8 (z i ca) 








N 








= 


i>r+1 


v= (lja Age + z+ on) 


y Mn 
2ar paz 
2CM max ra IZE l2 


2 
m nra?/? i nVJTa\ >p 
< CMmax | —=— + 2v2 Iz l2- 


|E| JEL 








2 


IA 











IA 


where on the fourth line, we have used the fact that for any matrices Aj, || ¥;Ail]2 < X; |lA;l|2. This 
proves our claim. E 


3. Proof of Theorem 1.2 


Recall that the cost function is defined over the Riemannian manifold M (m,n) = G(m,r) x G(n,r). 
The proof of Theorem 1.2 consists in controlling the behavior of F in a neighborhood of u = (U,V) 
(the point corresponding to the matrix M to be reconstructed). Throughout the proof we let K(u) 
be the set of matrix couples (X,Y) € R”*” x R”*” such that ||X ||? < ur, ||Y ||? < ur for all i, j. 
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3.1 Preliminary Remarks and Definitions 


Given x; = (X1,¥) and x2 = (X2,¥2) E€ M(m,n), two points on this manifold, their distance is 
defined as d(x1,x2) = /d(X1,X2)? +d(¥1,¥2)?, where, letting (cos 61,...,cos@,) be the singular 
values of XT X2 /m, 





d(X1,X2) = |[6|l2. 


The next remark bounds the distance between two points on the manifold. In particular, we will 
use this to bound the distance between the original matrix M = ULV" and the starting point of the 
manifold optimization M = XoSoYo . 


Remark 3 (Keshavan, Montanari, Oh, 2009) Let U,X € R”*" with UTU = XTX = ml, V,Y € 
R"*" with VTV = YTY =nl, and M =USV', M = XSY" for X£ = diag(X,...,Z,) and S € R”. 
TfX,,...,L, > Lmin, then 








T ~ T 
M-—Mllr , d(V,Y)< 
VIANE min l ll (Yeki VIANE min 


Given S achieving the minimum in Eq. (2), it is also convenient to introduce the notations 


d(U,X) < |M — M||F 





d (x,u) = /E2iqd(x,u)? + [|S El, 


min 





d, (x,u) = y Ehad (x, 0)? + [IS — El 


3.2 Auxiliary Lemmas and Proof of Theorem 1.2 


The proof is based on the following two lemmas that generalize and sharpen analogous bounds in 
Keshavan et al. (2010). 


Lemma 4 There exist numerical constants Co,C,,Cz such that the following happens. Assume 
€ > Couor/@ max{ logn; uor/O(Lmin/Lmax)* } and § < Lmin/(CoLmax). Then, 


F(x)—F(u) > Cınevad_ (x,u)? — CinVral|Z® ||zd+ (x,u), (8) 
F(x)—F(u) < CyneVad?,,,d(x,u)? + Conv ral|Z* |d, (x,u), (9) 


for allx € M(m,n)N K(4uo) such that d(x,u) < ò, with probability at least 1—1/n*. Here S € R?” 
is the matrix realizing the minimum in Eq. (2). 


Corollary 3.1 There exist a constant C such that, under the hypotheses of Lemma 4 


IS- Dlr <CEmaxd (xu) +C 2p. 


Further, for an appropriate choice of the constants in Lemma 4, we have 


r 

Omax(S) awe |Z" |, (10) 
1 r 

Ornin(S) > 52min — CZF I. a) 
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Lemma 5 There exist numerical constants Co,C,,Cz such that the following happens. Assume 
€ > Color va (Lax /Lmin)* max {logn; uors/O(Lmax /Lmin)* } and Ò < Lmin/(CoLmax). Then, 


2 
VTE max IZ? |l2 


, 
€Ximin Z min if 


|| grad F (x) ||? > Cyne? Efn |d(x,u) — Co 


‘min 





(12) 


for all x € M(m,n) N K(4uo) such that d(x,u) < 5, with probability at least 1—1/n*. (Here [a], = 
max(a,0).) 


We can now turn to the proof of our main theorem. 
Proof (Theorem 1.2). Let 6 = Ynin/CoLmax with Co large enough so that the hypotheses of Lemmas 
4 and 5 are verified. 

Call {x4 }k>0 the sequence of pairs (Xk, Y) E€ M(m,n) generated by gradient descent. By as- 
sumption the right-hand side of Eq. (3) is smaller than Xmin. The following is therefore true for 
some numerical constant C: 





€ /Emin\? 
ZE l2 < (S|) Emin. 1 
IZ < aan (E) Eni a3) 


Notice that the constant appearing here can be made as large as we want by modifying the constant 
appearing in the statement of the theorem. Further, by using Corollary 3.1 in Eqs. (8) and (9) we get 


R 
O 
| 
z 
E 
V 


Cine OX nin {d(x,u)” — 55}, (14) 
< CyneVaxj,,,{d(x,u)* +85, }, (15) 


R 
O 
| 
A 
E 
A 


with Cı and C, different from those in Eqs. (8) and (9), where 


Vr2max IZ? |l2 
€Xmin ymin 








Vr&max IZ? |l2 : 


do. =C 
ý EÈ min Xmax 


80,4 =C 
By Eq. (13), with large enough C, we can assume 69, < 6/20 and 69,4 < (8/20) (Zmin/Zmax )- 

Next, we provide a bound on d(u, x). Using Remark 3, we have d(u,xo) < (1/nV/OX min) ||M — 
XoSoYd ||F. Together with Theorem 1.1 this implies 





CM max el M C'y/r 


ZE JJa. 
Z min € €Ximin | | | | 


d(u, xo) < 


Since £ > C” ir (Zmax/Lmin)* as per our assumptions and Mmax < U1 yTEmax for incoherent M, 
the first term in the above bound is upper bounded by Emin /20C0£max, for large enough C”. Using 
Eq. (13), with large enough constant C, the second term in the above bound is upper bounded by 
LYmin/20CoLmax. Hence we get 


ô 
<—. 
d(u,xo) = 10 


We make the following claims : 
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1. xx E K(4uo) for all k. 


First we notice that we can assume Xo E€ K(3u0). Indeed, if this does not hold, we can 
‘rescale’ those rows of Xo, Yo that violate the constraint. A proof that this rescaling is possible 
was given in Keshavan et al. (2010) (cf. Remark 6.2 there). We restate the result here for the 
reader’s convenience in the next Remark. 


Remark 6 Let U,X € R’" with UTU = XTX =nl and U € K(uo) and d(X,U) < è< % 
Then there exists X' € R” such that X'TX' = nl, X' € K(3uọ) and d(X',U) < 46. Further, 
such an X' can be computed from X in a time of O(nr’). 


Since xp € K(3u0) , F(xo) = F (Xo) < 4Cgne/QZ;,,x5°/100. On the other hand F(x) > 
p(e!/9 — 1) for x ¢ K(4uo). Since F(x,) is a non-increasing sequence, the thesis follows 
provided we take p > Cane vVar? 


min’ 


2. d(xg,u) < 6/10 for all k. 
Since £ > Capir r? (Z max/Emin)f as per our assumptions in Theorem 1.2, we have d(xy,u)* < 
(C1£2 in/C222ax) (5/20). Also assuming Eq. (13) with large enough C, we have 8 ,_ < 6/20 
and 50,4 < (8/20) (Zmin/Zmax). Then, by Eq. (15), 
28? 

min 400 ` 


Also, using Eq. (14), for all x; such that d (x,u) € [6/10,8], we have 


F (xo) < F (u) +Cinev ar? 


2 


F(x ke F(u u) + Cine VAL inin og 


Hence, for all x, such that d(x;,u) € [6/10,6], we have F(x) > F(x) > F(xo). This contra- 
dicts the monotonicity of F (x), and thus proves the claim. 


Since the cost function is twice differentiable, and because of the above two claims, the sequence 
{xz} converges to 


Q = {x € K(4u9) N M(m,n) : d(x,u) <8,grad F(x) =0}. 
By Lemma 5 for any x € Q, 
V2 max |Z" |lo f 


d(x,u) < C 
( ) €Limin Lmin 


(16) 


Using Corollary 3.1, we have d4 (x,u) < Ymaxd(x,u) + ||S — El] < CEmaxd (x, 0) +C(/r/€)||Z* |2- 
Together with Eqs. (18) and (16), this implies 


2 E 
nyQ e 
which finishes the proof of Theorem 1.2. E 
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3.3 Proof of Lemma 4 and Corollary 3.1 


Proof (Lemma 4) The proof is based on the analogous bound in the noiseless case, that is, Lemma 
5.3 in Keshavan et al. (2010). For readers’ convenience, the result is reported in Appendix A, 
Lemma 7. For the proof of these lemmas, we refer to Keshavan et al. (2010). 

In order to prove the lower bound, we start by noticing that 


F) < $22, 


which is simply proved by using S = Ł in Eq. (2). On the other hand, we have 


1 
F(x) = 5llfe(xs¥" —M—Z) If 
1 2 1 T 2 T 
= 512Z +5 Pe(XS¥" —M) I? — (e(Z), (XS¥7 —M)) (17) 
> F(u) +CneVGd_(x,u)?— V2r||Z*||o||X SY" — Mlle, 


where in the last step we used Lemma 7. Now by triangular inequality 


|XSY"—Mllz < HEISE ee ae Vee SUEY le 





IA 


1 
3nm||S —||% + 3nd? |X —U||74 -IY V2) 


max (+ | 


IA 


Cn’ad, (x, 0), (18) 
In order to prove the upper bound, we proceed as above to get 
F(x) < 5||Pe(Z) |p + Cne vaha d(x, 0)? + V2r0||Z* |2Cnd; (x,u). 
Further, by replacing x with u in Eq. (17) 
F(u) > SZZ- (Pe(Z), (U(S-2)V”)) 
> 5 llPe(Z)|\% — V2ral|Z* |2Cnd; (x,u). 


By taking the difference of these inequalities we get the desired upper bound. | 


Proof (Corollary 3.1) By putting together Eq. (8) and (9), and using the definitions of d,(x,u), 
d_(x,u), we get 





Ci +C 


ten 
Is-2} < FOr atau) 





"IZE lay Zad (xu)? + |S — El- 


Let x = ||S— £lļ|r, a? = ((C1 +C2)/C1) To and b = ((Cı +C2)V/r/C\e)||Z* ||2. The above 
inequality then takes the form 


<a +bVx2+4+a2 <a +ab+bx, 


which implies our claim x < a + b. 
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The singular value bounds (10) and (11) follow by triangular inequality. For instance 
r 
Omin(S) = Xin = CX maxd (x, U) = owt IZEI . 


which implies the inequality (11) for d(x,u) < 6 = Ymin/CoLmax and Co large enough. An analo- 
gous argument proves Eq. (10). a 


3.4 Proof of Lemma 5 


Without loss of generality we will assume 6 < 1, C2 > 1 and 


r 
MIZE ll < Eain, (19) 
because otherwise the lower bound (12) is trivial for all d(x,u) < ô. 

Denote by ¢ ++ x(t), t € [0,1], the geodesic on M(m,n) such that x(0) = u and x(1) = x, 
parametrized proportionally to the arclength. Let W = x(1) be its final velocity, with w = (W, Q). 
Obviously W € Tx (with Tx the tangent space of M(m,n) at x) and 


1x l a 
— |W]? + -||O]|? = d(x, u)? 
z [WIE + zll = d (x,u), 


because ft ++ x(t) is parametrized proportionally to the arclength. 
Explicit expressions for W can be obtained in terms of w = x(0) = (W,Q) (Keshavan et al., 
2010). If we let W = LOR’ be the singular value decomposition of W, we obtain 


W =—UROsinOR’ + LOcosOR? . (20) 


It was proved in Keshavan et al. (2010) that (grad G(x), W) > 0. It is therefore sufficient to lower 
bound the scalar product (grad F, W). By computing the gradient of F we get 


(grad F(x),W) = (Pe(XSYT —N),(XSO" +WSY')) 
= (Pe(XSY? —M),(XSO™ +WSY")) — (Pe(Z), (XSO™ +WSY")) 
= (grad Fo(x),W) — (Pg(Z),(XSO" +WSY")) (21) 


where Fo(x) is the cost function in absence of noise, namely 


3 1 T 2 
P(X,Y) = ie | p2 ((XSY")i; —Mij) (22) 
(ij)EE 
As proved in Keshavan et al. (2010), 
(grad Fo(x),W) > CnevaXž d(x, u)? (23) 


(see Lemma 9 in Appendix). 
We are therefore left with the task of upper bounding (Pg (Z), (XSQT +WSY*)). Since XSQT 
has rank at most r, we have 


(Pe (Z),XSQ") < Vr ||Z*||2||XSO" lF . 
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Since XTX = ml, we get 


IXS} = mTr(S7SQ"O) < no max(S)*||Ollz 


r 2 
< Cr’a(Ema + IZ" le) d(x,u)’ (24) 
< 4Cn*?ax?.,. d(x,u)’, 


where, in inequality (24), we used Corollary 3.1 and in the last step, we used Eq. (19). Proceeding 
analogously for (Pg (Z), W SYT), we get 


(Pz(Z),(XSO? +WSY")) < Cn max VrO |Z? || d(x,u). 


Together with Eq. (21) and (23) this implies 





Vr2max eal 


g al ry WwW n 105 i ; u 5 u 2 
€Xmin ymin 


which implies Eq. (12) by Cauchy-Schwartz inequality. 


4. Proof of Theorem 1.3 


Proof (Independent entries model ) We start with a claim that for any sampling set E, we have 
2" l2 < IZ" I. 


To prove this claim, let x* and y* be m and n dimensional vectors, respectively, achieving the opti- 
mum in MAX xj) <1 plc ZEY}, that is, such that ||ZF ||) = x*7Z®y*. Recall that, as a result of the 
trimming step, all the entries in trimmed rows and columns of ZE are set to zero. Then, there is no 
gain in maximizing xTZEy to have a non-zero entry x; for i corresponding to the rows which are 
trimmed. Analogously, for j corresponding to the trimmed columns, we can assume without loss of 
generality that yj = 0. From this observation, it follows that xT ZE ye =x*'Z" y*, since the trimmed 


matrix Z and the sample noise matrix Z? only differ in the trimmed rows and columns. The claim 
follows from the fact that x*7Z” y* < ||ZF ||2, for any x* and y* with unit norm. 

In what follows, we will first prove that |Z“ ||. is bounded by the right-hand side of Eq. (4) 
for any range of |E|. Due to the above observation, this implies that ||Z”||7 is also bounded by 
Coy evalogn, where € = |E|/ van. Further, we use the same analysis to prove a tighter bound in 
Eq. (5) when |E| > nlogn. 

First, we want to show that ||ZF||2 is bounded by Coy/e€./Glogn, and Z;;’s are iid. random 
variables with zero mean and sub-Gaussian tail with parameter 6. The proof strategy is to show that 
E [z7 l2] is bounded, using the result of Seginer (2000) on expected norm of random matrices, and 
use the fact that || - ||2 is a Lipschitz continuous function of its arguments together with concentration 
inequality for Lipschitz functions on i.i.d. Gaussian random variables due to Talagrand (1996). 

Note that || - ||2 is a Lipschitz function with a Lipschitz constant 1. Indeed, for any M and M’, 
|IIM'||l2 — ||M||2| < ||M' — M||2 < ||M' — M||F, where the first inequality follows from triangular 
inequality and the second inequality follows from the fact that ||- ||? is the sum of the squared 
singular values. 
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To bound the probability of large deviation, we use the result on concentration inequality for 
Lipschitz functions on i.i.d. sub-Gaussian random variables due to Talagrand (1996). For a 1- 
Lipschitz function ||- ||2 on m x n i.i.d. random variables zi with zero mean, and sub-Gaussian tails 
with parameter 6”, 

2 
P(\|Z" lI —El|Z"llo] >) <exp{ - 55}. 25) 


Setting t = \/ 807 logn, this implies that ||Z” 
than 1 — 1/n’*. 

Now, we are left to bound the expectation E[||Z*||2]. First, we symmetrize the possibly asym- 
metric random variables ZE to use the result of Seginer (2000) on expected norm of random matrices 
with symmetric random variables. Let Z; ;; S be independent copies of Z;;’s, and &ij’s be independent 
Bernoulli random variables such that §;; = +1 with probability 1/2 and &;; = —1 with probability 


1/2. Then, by convexity of E[||Z” — Z” ||2|Z'7] and Jensen’s inequality, 
E(|Z" lo] < EZ" -Z ll] = E[I(Gi(Zij -—Z7 Dll] < 2E [I (GijZ7) lla] > 


where (§;;Z E) denotes an m x n matrix with entry &;;Z fin position (i, j). Thus, it is enough to show 


that E||| Z7 lo] is bounded by Co y eyv/alogn in the case of symmetric random variables Z;;’s. 
To this end, we apply the following bound on expected norm of random matrices with i.i.d. 
symmetric random entries, proved by Seginer (2000, Theorem 1.1). 


e[IZ* a] < C(E [max Zell] +E [max Ze l]) , (26) 


























2 < E|||Z||2] + \/807logn with probability larger 


























































































































where ZË and Ze, denote the ith row and jth column of A respectively. For any positive parameter 
B, which will be specified later, the following is true. 


2 [max ZS) |"] < Bo eva | P (max |IZË P > o?e va +z) dz. (27) 














To bound the second term, we can apply union bound on each of the n columns, and use the follow- 
ing bound on each column zš resulting from concentration of measure inequality for the i.i.d. 


sub-Gaussian random matrix Z. 
P($ zE P > Bo ?eVai+z) <exp{ — $ ((B- ajeva+5)}. (28) 
k=l 
To prove the above result, we apply Chernoff bound on the sum of independent random vari- 
ables. Recall that Zij = kjZkj where §’s are independent Bernoulli random variables such that 


E = 1 with probability €/y/mn and zero with probability 1—¢/,/mn. Then, for the choice of 
ù = 3/807 < 1/207, 


fen uf zur] = (Ja fae)" 









































. ( Tr a) 
= exp {mlog (+s) 
< exp{eva}. 
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where the first inequality follows from the definition of Zg; as a zero mean random variable with 
sub-Gaussian tail, and the second inequality follows from log(1 +x) < x. By applying Chernoff 
bound, Eq. (28) follows. Note that an analogous result holds for the Euclidean norm on the rows 
|z. 

Substituting Eq. (28) and P( max; IZ; >z) < mP(\|ZE,|I > z) in Eq. (27), we get 














80? 3 
z [max |Z 1P] < Bovey + Se 8 -eva 29) 


The second term can be made arbitrarily small by taking B = Clogn with large enough C. Since 
E [ max; ||ZŽl|] < 4/E[ max; Izl], applying Eq. (29) with B = C logn in Eq. (26) gives 


E[I|Z" l2] < Coy/evalogn . 


Together with Eq. (25), this proves the desired thesis for any sample size |E]. 

In the case when |E| > nlogn, we can get a tighter bound by similar analysis. Since € > C' logn, 
for some constant C’, the second term in Eq. (29) can be made arbitrarily small with a large constant 
B. Hence, applying Eq. (29) with B = C in Eq. (26), we get 


i[|IZ" ||2] < Coy eva. 


Together with Eq. (25), this proves the desired thesis for |E| > nlogn. 


















































Proof (Worst Case Model ) Let D be the m x n all-ones matrix. Then for any matrix Z from the worst 
case model, we have ||Z™ ||> < Zmax||D® ||2, since x7 Z¥y < E; j Zmax|xi|DE ly |, which follows from 
the fact that Z;;’s are uniformly bounded. Further, DE is an adjacency matrix of a corresponding 
bipartite graph with bounded degrees. Then, for any choice of E the following is true for all positive 
integers k: 


ID" Ia < Tae x" (B®) DE) x| < Tr(((D®)"D*)*) < ne). 


~~ x, lx 


Now Tr( ((DE)T DE )*) is the number of paths of length 2k on the bipartite graph with adjacency 
matrix DE, that begin and end at i for every i € [n]. Since this graph has degree bounded by 2e, we 
get 


DP [126 < n(2e)™*. 


Taking k large, we get the desired thesis. a 
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Appendix A. Three Lemmas on the Noiseless Problem 
Lemma 7 There exists numerical constants Co,C,,Cz such that the following happens. Assume 
€ > Coor a max{ logn; uor/O(Lmin/Lmax)* } and & < Lmin/(CoLmax). Then, 
1 
Ci VOX in d(x, u)? T Cıva I|So fa Xll? < z P < CoVOX hax (x,u)? , 


for all x € M(m,n) A K(4uo) such that d(x,u) < 8, with probability at least 1—1/n*. Here So € 
R’*" is the matrix realizing the minimum in Eq. (22). 


Lemma 8 There exists numerical constants Co and C such that the following happens. Assume 
€ > Color v/a (Emax /Lmin)* max{ logn; uor vVÁ(£max/Xmin)4 } and 6 < Emin/(Co£max). Then 
|grad Fo(x) ||? > Cne? Zfind (x,u)? , 


for all x € M(m,n) A K(4uo) such that d(x,u) < 8, with probability at least 1 — 1 /nô. 


Lemma 9 Define W as in Eq. (20). Then there exists numerical constants Co and C such that the 
following happens. Under the hypothesis of Lemma 8 


(grad Fo(x), W) > Cneva End (x,u), 


‘min 


for allx € M(m,n)N K(4uo) such that d(x,u) < 8, with probability at least 1—1/n*. 
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